Author
|
Topic: Validated Polygraph Techniques
|
blalock Member
|
posted 10-23-2006 06:58 PM
First, let me say how much I enjoyed reading the latest edition of POLYGRAPH. As I read President Krapohl's article regarding Validated Polygraph Techniques, I had a few questions that I am hoping this forum could expound upon for me. On page 152, column 1 and again page 153, column 2 bullet 2, there are references to human scoring versus computer algorithm scoring. Page 150 column 2 paragraph one notes that the accuracy figures are base on human decisions rather than algorithm decisions. Now, if, for example, the Utah ZCT has an accuracy rate of 91% (without inconclusives) using human decisions, QUESTION ONE: How much better is the Utah ZCT with algorithm decisions? Second, page 153, column 1 bullet 3 refers to "two-stage decision rules." QUESTION TWO: What are "two-stage decision rules?" Third, the next bullet down (bullet 4) refers to the "total chart minute concept." QUESTION THREE: What is the "total chart minute concept?" I seem to remember something about how after an X amount of charts, further charts detract from accuracy. Ben IP: Logged |
Barry C Member
|
posted 10-24-2006 07:53 AM
I haven't received my copy yet, but now I'm waiting for the mailman to arrive.Anyhow, I can answer a couple of your questions: Two-stage decision rules mean just that: two stages. First there is a standard total score cut-off, such as +/-6. That's stage one. If your test is INC after stage one, then go to stage two, which would look at spot scores, e.g., a -3 in any spot (regardless of the total score) would result in a DI call. Utah rules for example, look only at total scores. Even if you have a -3 in one spot, if the total score is +6 or greater, the call is NDI. The same is true with the Evidentiary Scoring Rules Don Krapohl and I wrote about in a recent journal. However, with Evidentiary Scoring Rules, a two-stage system is used. That is, if a total score falls in the INC range (-5 to +3), then spot scores are considered. What we find is that if we look at both at the same time (total scores and spot scores), we end up with more NDI calls that are wrong or INC. (We do gain more correct DI calls, but at that expense.) The Total Chart Minutes Concept is a Backster teaching that has to do with habituation over a number of charts. Essentially, it says that a person is firing hard over a limited period of time, and as time goes on, the person becomes less responsive, essentially to the point of not being worthwhile. And, each channel has a different peak and valley. The data is a contradictory, but it appears habituation can occur within charts, but typically not across charts. Krapohl battled Backster and Matte on this one a few years back in POLYGRAPH. It's an interesting read, but I wouldn't put it on the essential reading list. True or not, you're going to do what you do anyhow. If a person's out of juice, he's out of juice, and everybody's different. (I've heard some say dogmatically you can run 14 charts on any one person, and it stems from this concept.) I don't know what you mean by question one. Perhaps after I read it I will? IP: Logged |
Barry C Member
|
posted 10-30-2006 04:14 PM
Okay. I got it today. (Must have been pony express?)Question one is tough to answer since there are so many variables involved, but here goes: When we're talking the Utah test, we're talking the CPS scoring algorithm - or at least we should be. (The CPS algorithm was developed by Drs. Raskin and Kircher, the designers of the CPS system made by Stoelting.) The short answer is that the algorithm is at least as accurate as an examiner who scores charts well. It outperforms blind scorers. So when you ask, "How much better...?" what do you really mean? This is a good question for our resident statistician. If CPS is 90% accurate (and that's a low, but fair figure) and an examiner is that accurate as well, what have we gained (i.e, should we be more confident)? If the examiner is 86% accurate (as we see in the NSA study), then we've gained a little ground. If he's one of the 50% who doesn't achieve the 86% mark, then a lot can be gained. So Ray, at what point would we consider it "better,", and more importantly, why? IP: Logged |
rnelson Member
|
posted 10-31-2006 12:00 AM
Barry,I certainly don't pretend to have all the answers... just .02 worth of commentary. I think someone else on this board - perhaps J.B. pointed out that "better" is also a matter of co$t - as in, sometimes slight improvements are not worth huge costs. The recent Stern report may be an example of this – the best thing to do may have as much to do with long term and overall costs as much as present technology. I think you said it correctly in a way...
quote: If CPS is 90% accurate (and that's a low, but fair figure) and an examiner is that accurate as well, what have we gained (i.e, should we be more confident)? If the examiner is 86% accurate (as we see in the NSA study), then we've gained a little ground. If he's one of the 50% who doesn't achieve the 86% mark, then a lot can be gained.
We of course don't know whether we are among those examiners who achieve 86% or better. This reminds me about the one fundamental constant that all persons everywhere have in common... that is, deep down, inside, we all consider ourselves to be above average in driving skills. All professionals have vested interests in assuming they are good at their job, even if they are not. So one thing we gain is a form of field QC - if we are inconsistent with our algorithm in ways that we cannot account for or point to in the charts, then there may be room for learning or improvement. I don't mind seeing an examiner disagree with a computer score, when we can identify in the charts why the algorithm wants to treat a tracing segment differently than the examiner. That is one of the problems with existing computer scoring tools. Stoelting seems to do better than some, but most don't tell us what features and data points the employ. and for this... quote: So Ray, at what point would we consider it "better,", and more importantly, why?
Barry, In terms of accuracy, better is a matter of two things, the first of which is reliability. Will it be done the same way every time. Imagine trying to improve on something like pistol or rifle marksmanship if the use/shooter makes spurious changes in grip, ballistics, and sights with every shot. It is sometimes necessary to reproduce a missed shot in order evaluate all the variable that contribute to it. Reproducing an error involves doing it the same way again. Then a thoughtful change can be made to one parameter at a time grip, sight picture, windage/elevation, ballistics, etc.
So, standardizing seems to be a good thing. In the absence of technology, we base standards on the face valid opinions of "experts." What my pinheaded son's debate class calls "appeal to authority." (the big kahunas said so.) That is entirely valid, because the absence of research doesn't relieve administrators and policy-makers from making decisions. Consider traffic lights, and crosswalks - who had research on them when they built the first ones. They have face-validity - it makes sense to take turns stopping and going at intersections. On the other hand, remaining stagnant at that point also doesn't make sense to stop without research, or to fail to modify our methods in response to data. Crosswalks alone do not seem to correlated with reduced pedestrian-vehicle accidents. Additionally, there is some research that says that the simple design of roadways may have a more powerful effect on traffic safety than signs and lights that direct speed and turn-taking at intersections. So now we see more cities building things like roundabouts, that effectively control speed and turn-taking with fewer lights and signs. Safety at intersections may have as much to do with caution and paying attention, as stopping and going in the right order. Decorative roundabouts and curving roadways prompt increased caution and attention to surroundings - the difference between driving directly towards a vehicle that might move out of the way, and driving directly towards a pile of decorative concrete that will not. Effects are sometimes subtle and sometimes obvious. In my area, the county effectively slowed traffic on a narrow older street which could not be widened, by slightly narrowing the roadway and by providing non-continuous parking regions in front of businesses. The street seems now to vary continously in width and, the whole road now speaks to going slower - it would be uncomfortable to go faster, and when people are in a hurry to get to work, they don't tend to think about that road anymore. On the other hand the wide streets of some towns seem like they are designed for kind of fun that exceeds the posted legal limit. My point is that the tools and resources we are provided have just an important influence on behavior and activity as does training. This may be as true with polygraph tools as our ultimate driving machines. You can see other examples of how our technology influences behavior in polygraph. I've objected to the folks at limestone including two pneumo scores in their hand-scoring scratch-pad, as that seems to encourage scoring the pneumos separately - effectively overweighting the pneumos. Another example of tools influencing activities occurred in this case (http://www.raymondnelson.us/qc/060406.html) for which another examiner had previously cleared the subject of the allegations (he later confessed). The examiner used a hand-scoring sheet with footer marking that said "American Polygraph Association, (2004), Keith Hedges" - aparantly obtained in a training presentation. The score sheet has lines for both upper and lower pneumo scores, of which the examiner added both into the score - effectively weighting the pneumos at up to 50 percent of the total possible score. But that footer sure makes it look like the examiner uses "official" scoring tools. While standardization and improved reliability/repeatability are important, that alone does not guarantee accuracy. There is a consistent trend in the sciences toward simplicity or parsimony - simplicity seems to underly both reliability/repeatability and generalizability/cross-validation. You can see this in polygraph in that our most robust and objective scoring systems have the simplest measurement paradigms. This is consistent with other sciences. While complex evaluation schemes can very accurately describe a development sample, they sometimes don't do as well on cross-validation as simpler schemes. (I can't think of a good example right now). Better is also not simply a matter of hit-rate accuracy, but whether our methods satisfy common empirical considerations, such as describability in terms of scientific theories and phenomena that are recognizable to sister sciences - psychology, physiology, inferential statistics, testing and diagnosis, and decision theory. Whether we can estimate and account for the probability of error, as Daubert requires. Well-described computerized algorithms call help with this. The discussion about total chart minutes is a good example of the polygraph science not completely incorporating information from sister sciences into our argument and rhetoric, and I agree with Barry's take on the matter. It seems like I read a study, maybe Dollins (sometime close to 1998 or 2000) about habituation and CIT or GKT tests. Stu Senter at APA 2002? talked about this and the fact that diagnostic response was observable at five or 20 some-odd charts. That's important to our understanding of habituation - which is not adrenal exhaustion. When Barry says there does not appear to be habituation across charts - I think he actually speaks to the phenomena of dishabituation (which is distinct from sensitization), or the regaining of a previously habituated response potential. All kinds of things can dishabituate a subject's habituated response - distraction, confrontation, laughter (thought we generally don't make jokes during a polygraph test), change of topic, or the simple physical activity from changing EDA sensors. I think what is not emphasized enough is that habituation effects may be quite different in CQT and GKT test situations, so generalizing the useability of response from GKT research data to CQT tests is something to do only with great caution. I'm not sure I've answered any question, but I am sure I rambled a bit. The empirical answer to 'how do we know which is better' is through data and description. Someone does a study and describes the data, then we can compare that to other studies that describe the data outcomes of other methods (some data outcomes are cost effects). Ideally we don't have just one or two studies but a series of studies that build a scaffold of knowledge on which someone could do meta-analysis (studying our studies) to make better sense of the big picture. The NRC/NAS report was a form of this. So Barry, your question, quote: at what point would we consider it "better,", and more importantly, why?
We would not assume betterness based on just one study – we would prefer several studies, repeatable betterness, based on larger samples that more closely approximate the population of examiners in different field situations – criminal investigation, LEPET, PCSOT, security screening, fishing, bodybuilding, fidelity (and of course reality television and radio pranks). We also expect convergence of new information with existing knowledge about psychophysiology and polygraph. The In the absence of access to larger amounts of representative data on hand-scoring and algorithm scoring we could use resampling or monte-carlo methods to randomly create a distribution of sample distributions. With a sample of say N=50, you could create a resampled distribution of 100 random samples of 30 or so cases each – drawn randomly from the sample of 50. Then we compute means and deviations to describe the distributions of each of those resamples, and construct another mean and devation score from means of those resamples. Argh. These methods have been found to produce sample distributions that can quite closely approximately the population from which the original sample was drawn. Its basically like using a small sample to mimick a larger sample. I haven't yet seen or heard of this in polygraph, but its common in other fields where it is sometimes difficult to gain enough sample data that is representative of the large population. So, I know its not a completely satisfying answer, but better is what makes us work smarter, and what helps understand our science and our objectives more accurately. I don't think anyone is going to pull a rabbit out of a hat and magically fix all the problems we face. Its going to be a gradual processing of scaffolding new knowlege, techniques and instrumentation onto existing methods - just like every other field of science. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(from Dr. Strangelove, 1964)
[This message has been edited by rnelson (edited 10-31-2006).] IP: Logged |
Barry C Member
|
posted 10-31-2006 09:07 AM
How do you write so coherently at that hour?Ben's question was, "How much better is the Utah ZCT with algorithm decisions?" You've answered the question very well (as I expected you would), but I think the question that is still lurking is whether an examiner's decision (if we know how well he scores charts) combined with an algorithm decision that is say, on average, 90% accurate, is "better" from a statistical standpoint. The questions springs from this: If two examiners score a chart as NDI, how should that affect confidence in the result? What if a person is tested twice by two different examiners on the same issue and the result is two NDI decisions? I've heard the question now that paired testing is out there, but that is very different in that we have two people on opposite sides of an issue being tested by two different examiners with known error rates. From that we can calculate error rates and have more confidence in that we know the base rate is (or should be anyhow) 50%. I think that's where this is coming from, and even if it's not, it's becoming a more common question, and the answer is one that must come from an understanding statistics (almost a four-letter word?). IP: Logged |
rnelson Member
|
posted 10-31-2006 09:27 AM
OK,Here is the short, non-math version. If the odds of an examiner error are low, and the odds of the algorithm error are also low. There is some simple math that tells us that the odds of both being wroing is quite a bit lower than either alone. So by that logic, I live in the safest home in Denver - what are the odds of the same house burning down twice. By the way you should also carry a bomb on a plane - because the odds of their being a bomb on a plane are quite small, but the odds of their being two bombs on a plane are infinitesimal. So, OK it makes sense, but there may be some practical flaw. r
------------------ "Gentlemen, you can't fight in here, this is the war room." --(from Dr. Strangelove, 1964) IP: Logged |
J.B. McCloughan Administrator
|
posted 10-31-2006 10:07 AM
Barry,I think one big variable is missing in this discussion of the potential “accuracy” of an algorithm compared to a human examiner and that is intelligence. At this time, it is still necessary for an examiner to inspect the data to ensure that it is of sufficient quality for an algorithm to score. An algorithm cannot discern between a response, movement, and/or countermeasure very well (think of a dumbwaiter). In addition, a problem with algorithm validation is that much of the confirmed data pools available have been utilized to program said algorithms. It would be logical to say that an algorithm is much better at achieving reliability when repeat data is presented to it. If that data was used to program the algorithm then it will repeatedly make the correct decisions. This can lead to an inflated accuracy figure. With field validation it faces the same problems as human decisions. At such time an algorithm can navigate the complexity of polygraph charts better than a human examiner can then it will be time for us to cease the roll of examiner and assume the roll of an operator (food for thought).
IP: Logged |
rnelson Member
|
posted 10-31-2006 11:15 AM
Agreed, J.B.,Your point is well taken, and is exactly what I meant about being able to point to the charts as to why we may disagree. At this point, mechanical measurement of polygraph data appears to be highly reliable (comes out the same every time), but remains fairly blunt regarding the overal quality of the data. Spend some time with the Extract program and you'll see that it evaluates the data at only a couple of specified measurement points. Stoelting literature describes that their software will mark the charts where measurements are taken (I like that), implying again that only a couple of points are used. I think its tempting to assume they look at a lot more data points than they actually do. Bottom line - computers cannot detect countermeasures. They presently evaluate only data values, and not the overal signature shape of the data. Algorithm scores of bad data are probably more dangerous than worthless - because they may be tempting to use. Your other point is really a description of the problem of overfitting decision models to a dataset. It is primarily a problem, when test developers perform data fitting (feature extraction) and decision threshold setting on the same data. The results will accurately describe the dataset used, but do not always generalize to other datasets and field situation in intended ways. The developers of some algorithms (including, I think, polyscore) have described how they set aside a portion of their original dataset for later decision fitting. This is a common approach, but does not replace the need for cross-validation with a different sample. It seems funny to me that some of our more prized tools are so thinly described. But that's party on us - we have to decide if we want to be simply a market for the proprietary intrests of polygraph equipment salespersons, or scientists and professionals who understand our work (and its limitations), and have a strategy for improvement. r ------------------ "Gentlemen, you can't fight in here, this is the war room." --(from Dr. Strangelove, 1964) IP: Logged |
Barry C Member
|
posted 10-31-2006 05:31 PM
Oh you're holding back. I was expecting more.Have any of you read the Raskin and Kircher chapter in Kleiner on computer scoring? (It does address some of your questions Ray, but it's been a while since I've read it.) We should get Bruce White on board. He's at the point where he believes his software can tell the difference between truth, deception, movements, drift and CMs - with less data than we collect now. I'm a skeptic, but I enjoy talking to him about it. (At least somebody's trying.) CPS ignores about 2% of the data because it doesn't "look right" to it, which is the same the Utah scorers claim they tend to have to toss, so they'd argue the computer does catch most artifacts a human would. CMs don't appear to work (from the scientific literature anyhow), so they shouldn't "fool" the computer scoring either. APL catches some, but they suggest you don't let it score what you wouldn't. Drift is a different subject, and I'm really interested in seeing how we can deal with that one. This goes back to the "What will polygraph of the future look like?" question. IP: Logged |
rnelson Member
|
posted 11-01-2006 12:57 AM
Thanks Barry,I just read through it again. It's good description - a bit heavy on the CPS hype, but they've certainly earned that by doing their homework and documenting their methods as consistent with the state of the science. They describe a preference for using a variance based on the entire dataset of CQs and RQs, compared with the pooled variance method. Much of the content is similar to the 1988 publication. They describe their discriminate analysis in a little less detail, but it is more interesting alongside the little bit of inforation on logistic regression. I'm more familiar with logistic regression, but in the end it may not matter. Once we are convinced about the features we want to use, logistic regression or discriminant function both serve to tell us how to weight the parameter. They still don't report any descriptives or parameters on their normative data. Nor do they describe how they map their descriminant score onto the data. But it is not hard to imagine some form of hypothesis test at that point. Where they derive a descriminate function from the differences between a test subjects CQ and RQ mean z-scores, I have been playing with hypothesis test of the difference between CQs and RQs. Their bayesian equation at the end is described in 1988. I think this starts to give insight to one of my questions about how they calculate standard errors. The pooled method is more common, as it respects the independence of the two samples according to the requirements of t-test for two independent samples. Pooled variance is essentially a kind of weighted variance, that attempts to account for differences in sample size. In my experiments the difference between pooled and combined variance is some small its meaningless and does not affect the calculations - that is not surprising since I've been using hand scored data with interpolated CQ values. I'll return to the measurement testing stuff in a bit. I've essentially found what they describe. There are a number of ways to transform and standardize the data - a number of them seem to work, and none offer particular advantages. It makes sense to standardize is ways that are easily recognized and understood by others, which is what they describe with the within subject zscore transformation. They describe separate transformations for CQs and RQs. I've been using combined transformations, because the relative meaning or magnitude of a component response to a particular question is really only informative in the context of the other questions. Like the R/C ratios that OSS uses, zscores are simply another method of achieving a dimensionless measurement value with some expected distribution/shape parameter. Not surprizingly, zscores with interpolated CQ values, produces symetrical values for RQ and CQs. While I was standing at the bookshelf, I looked in the appendix of the NCR report and found an equation - cool - that describes polyscore's standardization. Oddly, they substract the CQ mean from each RQ value (I'd like to know more about why they do it that way) - and they use the pooled standard deviation. It seems like I saw something about detrending to accomodate drift - that's different than filtering at the electronics level prior to the DAC. That CPS ignores 2% of the data doesn't really speak to my point. That once features are identified, it is really only a few data-point measurements that are incorporated into measurement scores. I think the APL suggestion not to let it score anything you wouldn't is sound. I spent part of the afternoon - the part not hassling with Sprint - building a Kolmogorov-Smirnov test to back-up my Q-Q plots. As expected, not all data sets satisfy normality for parametric equations. Especially when we have all zero or minus scores or all zero and positive scores. Non-parametrics may be a more sound route to significance testing of hand-scored data. So, I think I'm about done with my math, and I think I'll leave all the parametric and nonparametric equations in the spreadsheet. (I still have more work to do on the measurement scoring spreadsheet.) But first, I have some questions for you about two stage decision rules and parsing separate RQs. Am I correct in thinking that for a single issue evidentiary test, a minus score for one RQ doesn't cause INC (or DI if -3) as long a the overall score is +6/NDI? Is this the same for two question and three question zone tests? I'll have to articulate this more clearly later, right now I'm bug-eyed and brain-dead from all this. r
------------------ "Gentlemen, you can't fight in here, this is the war room." --(from Dr. Strangelove, 1964) IP: Logged | |